Skip to content

Assignment 4 | Tristan Amiotte-Suchet#8

Open
letriton25 wants to merge 7 commits into
AA-parallel-computing:mainfrom
letriton25:tristan.amiotte-suchet
Open

Assignment 4 | Tristan Amiotte-Suchet#8
letriton25 wants to merge 7 commits into
AA-parallel-computing:mainfrom
letriton25:tristan.amiotte-suchet

Conversation

@letriton25

Copy link
Copy Markdown

Parallel Programming

Åbo Akademi University, Information Technology Department

Instructor: Alireza Olama

Student: Tristan Amiotte-Suchet

Student ID: 2501127

Homework Assignment 4: Optimizing Matrix Multiplication in C++

Due Date: 31/05/2026

Points: 100


Challenge of the Assignment

In this assignment, we are tasked with optimizing the performance of a naive matrix multiplication implementation in C++ using two techniques: cache optimization via blocked matrix multiplication and parallelization using OpenMP.

Before starting the optimizations, it is crucial to ensure that the naive matrix multiplication implementation is correct. Also, because all the benchmarks and tests rely on the correctness of the matrix multiplication, validating the results against a reference implementation is essential. It is also important to start by implementing the validation functions and all the other onces that are use for read and write the matrix files.

Let's have a small overview of this pre-optimization phase and the next steps when starting implementing the different optimizations.

Pre-Optimization Phase

I decided to divide the work and implement the read/write work in a separate library called matfile to keep the code organized and modular. And also cause it allows me to, later, write a side program to generate random matrices with specific dimensions for testing without having to duplicate the read/write code. So the my implementation of the work in the main_ans.cpp file has the same structure as the one provided in main.cpp.

Cache Optimization (Blocked Matrix Multiplication)

This part is the one that create me the most difficulties. At the beginning, I implemented it using the pseudocode provided in the assignment instructions. However, I quickly faced a performance issue. Since even with all different possible cache size, the speedup was not improved on the provided matrices. Using my side executable to generate bigger matrices like 1000x1000, I was finally able to notice a small speedup between 1.4x and 1.6x in general. However, even with bigger matrices the gain was not significant. Maybe the fact that the first implementation does not increase the the speedup is due to my hardware limitations, but I unfortunately have no idea about the exact cause. The code used at this moment is the following one:

void blocked_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t block_size)
{
    // A is m x n, B is n x p, C is m x p
    // Use block_size to divide matrices into submatrices
    for (uint32_t ii = 0; ii < m; ii += BLOCK_SIZE) {
        for (uint32_t kk = 0; kk < n; kk += BLOCK_SIZE) {
            for (uint32_t jj = 0; jj < p; jj += BLOCK_SIZE) {

                uint32_t i_end = std::min(ii + BLOCK_SIZE, m);
                uint32_t k_end = std::min(kk + BLOCK_SIZE, n);
                uint32_t j_end = std::min(jj + BLOCK_SIZE, p);

                for (uint32_t i = ii; i < i_end; ++i) {
                    for (uint32_t k = kk; k < k_end; ++k) {

                        float aik = A[i * n + k];

                        for (uint32_t j = jj; j < j_end; ++j) {
                            C[i * p + j] +=
                                aik * B[k * p + j];
                        }
                    }
                }
            }
        }
    }
}

After that, I decided to use a different approach. The concept is still the same and I continue the divide the matrices into blocks. But also start using a register blocking technique. The idea is to load the values of the output matrix ( C ) into registers, and then perform the calculations for a small block of columns (e.g., 8 columns at a time) while keeping the intermediate results in registers. This way, we can reduce the number of memory accesses and take advantage of the CPU's ability to perform multiple operations on the data stored in registers. With this new approach, I was able to achieve a much better speedup, around 2.8x in general on the provided matrices. The code used for this new approach is the one available in the main_ans.cpp with the last commited version of the blocked_matmul function.

Parallel Matrix Multiplication using OpenMP

This one was more straightforward to implement. I just had to add the OpenMP pragmas to the naive matrix multiplication implementation as explained in the assignment instructions. using exactly this following line #pragma omp parallel for before the first loop of the naive matrix multiplication implementation.

I also tried to vary the number of threads used for the parallel implementation by setting the OMP_NUM_THREADS environment variable. I noticed that until 4 threads, the speedup is perfectly linear, but after that, it stay a bit over 4x. Probably because of the overhead of creating threads and the fact that my CPU has 4 physical cores, so using more than 4 threads does not provide any additional performance benefit.

Results and benchmarks

After implementing both optimizations, I wrote a simple shell script to run several times my program on each provided matrices. The script is available in the benchmark.sh file. The results is the following table:

Test Case Dimensions (m × n × p) Naive Time (s) Blocked Time (s) Parallel Time (s) Blocked Speedup Parallel Speedup
0 64x64x64 0.000716916 0.000267748 0.00418022 2.69936 0.316059
1 128x64x128 0.00276601 0.000983891 0.00202219 2.81429 1.67944
2 100x128x56 0.00192184 0.000693677 0.00407294 2.79296 0.697054
3 128x64x128 0.00278392 0.00099082 0.00232068 2.81336 1.61629
4 32x128x32 0.000362589 0.000133682 0.00373411 2.74501 0.302438
5 200x100x256 0.0140396 0.00496904 0.0047582 2.82651 3.2177
6 256x256x256 0.0455719 0.0154318 0.0108826 2.95349 4.32004
7 256x300x256 0.0536304 0.0185423 0.0134807 2.89548 4.17668
8 64x128x64 0.00143829 0.000509443 0.0031398 2.84041 0.887305
9 256x256x257 0.0456388 0.0158858 0.0107711 2.88015 4.36675

We can notice on this table that the blocked matrix multiplication implementation provides a significant speedup over the naive implementation, with an average speedup of around 2.8x across all test cases. The parallel matrix multiplication implementation also provides a significant speedup, with an average speedup of around 3.2x across all test cases. However, the speedup from parallelization is not consistent across all test cases, and in some cases, it is even slower than the naive implementation. This could be due to various factors such as the overhead of creating threads, the size of the matrices, and the number of available CPU cores. The blocked version, on the other hand, consistently outperforms the naive version, and the speedup is usually the same across all test cases, which indicates that the cache optimization technique is effective regardless of the matrix size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant